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Abstract 

We obtain a tight distribution-specific characterization of the sample complexity of large-margin 
classification with L2 regularization: We introduce the margin-adapted dimension, which is a sim- 
ple function of the second order statistics of the data distribution, and show distribution-specific 
upper and lower bounds on the sample complexity, both governed by the margin-adapted dimen- 
sion of the data distribution. The upper bounds are universal, and the lower bounds hold for the rich 
family of sub-Gaussian distributions with independent features. We conclude that this new quantity 
tightly characterizes the true sample complexity of large-margin classification. To prove the lower 
bound, we develop several new tools of independent interest. These include new connections be- 
tween shattering and hardness of learning, new properties of shattering with hnear classifiers, and a 
new lower bound on the smallest eigenvalue of a random Gram matrix generated by sub-Gaussian 
variables. Our results can be used to quantitatively compare large margin learning to other learning 
rules, and to improve the effectiveness of methods that use sample complexity bounds, such as 
active learning. 

Keywords: supervised learning, sample complexity, linear classifiers, distribution-dependence 



1. Introduction 

In this paper we pursue a tight characterization of the sample complexity of learning a classifier, 
under a particular data distribution, and using a particular learning rule. 

Most learning theory work focuses on providing sample-complexity upper bounds which hold 
for a large class of distributions. For instance, standard distribution-free VC-dimension analysis 
shows that if one uses the Empirical Risk Minimization (ERM) learning rule, then the sample com- 
plexity of learning a classifier from a hypoth esis class with VC-dimension d is at most O ( jr) , where 
e is th e maximal excess classification error dVapnik and Chervonenkisl . 1971 ; Anthony and Bartlett , 



19991 ). Such upper bounds can be useful for understanding the positive aspects of a learning rule. 



However, it is difficult to understand the deficiencies of a learning rule, or to compare between 
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different rules, based on upper bounds alone. This is because it is possible, and is often the case, 
that the actual number of samples required to get a low error, for a given data distribution using a 
given learning rule, is much lower than the sample-complexity upper bound. As a simple example, 
suppose that the support of a given distribution is restricted to a subset of the domain. If the VC- 
dimension of the hypothesis class, when restricted to this subset, is smaller than d, then learning 
with respect to this distribution will require less examples than the upper bound predicts. 

Of course, some sample complexity upper bounds are known to be tight or to have 
an almost-matching lower bou nd. For instance, the VC-dimension upper bound is tight 



(IVapnik and Chervonenkisl . 119741) . This means that there exists some data distribution in the class 
covered by the upper bound, for which this bound cannot be improved. Such a tightness result 
shows that there cannot be a better upper bound that holds for this entire class of distributions. But 
it does not imply that the upper bound characterizes the true sample complexity for every specific 
distribution in the class. 

The goal of this paper is to identify a simple quantity, which is a function of the distribution, 
that does precisely characterize the sample complexity of learning this distribution under a specific 
learning rule. We focus on the important hypothesis class of linear classifiers, and on the popular 
rule of margin-error-minimization (MEM). Under this learning rule, a learner must always select a 
linear classifier that minimizes the margin-error on the input sample. 

The VC-dimension of the class of homogeneous linear classifiers in M*^ is d / Dudley . 19781) . 



This implies a sample complexity upper bound of O (^) using any MEM algorithm, where e is 
the excess error relative to the optimal margin error.Q We also have that the sample complexity of 
any MEM algorithm is at r nost Ofrf^), where is the average squared norm of the data and 



is the size of the margin ( Bartlett and Mendelson . 20021) . Both of these upper bounds are tight. 
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For instance, there exists a distribution with an average squared nor m of that requires a s many 
C ■ examples to learn, for some universal constant C (see e.g. Anthony and Bartlettl. lT999). 



However, the VC-dimension upper bound indicates, for instance, that if a distribution induces a large 
average norm but is supported by a low-dimensional sub-space, then the true number of examples 
required to reach a low error is much smaller. Thus, neither of these upper bounds fully describes 
the sample complexity of MEM for a specific distribution. 

We obtain a tight distribution-specific characterization of the sample complexity of large-margin 
learning for a rich class of distributions. We present a new quantity, termed the margin-adapted 
dimension, and use it to provide a tighter distribution-dependent upper bound, and a matching 
distribution-dependent lower bound for MEM. The upper bound is universal, and the lower bound 
holds for a rich class of distributions with independent features. 

The margin-adapted dimension refines both the dimension and the average norm of the data 
distribution, and can be easily calculated from the covariance matrix and the mean of the distribu- 
tion. We denote this quantity, for a margin of 7, by k^. Our sample-complexity upper bound shows 
that O(^) examples suffice in order to learn any distribution with a margin-adapted dimension of 

using a MEM algorithm with margin 7. We further show that for every distribution in a rich 
family of 'light tailed' distributions — specifically, product distributions of sub-Gaussian random 
variables — the number of samples required for learning by minimizing the margin error is at least 

n{k^). 



1. This upper bound can be derived analogously to the result for ERM algorithms with e being the excess classification 
error. It can also be concluded from our analysis in Theorem II llbelow. 
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Denote by m{e,j,D) the number of examples required to achieve an excess error of no more 
than e relative to the best possible 7-margin error for a specific distribution D, using a MEM algo- 
rithm. Our main result shows the following matching distribution-specific upper and lower bounds 
on the sample complexity of MEM: 

n{k,{D)) < m{e,^,D) < 6 . (1) 

Our tight characterization, and in particular the distribution-specific lower bound on the sample 
complexity that we establish, can be used to compare large-margin (L2 regularized) learning to other 
learning rules. We provide two such examples: we use our lower bound to rig orously est ablish a 



sample complexity gap between Li and L2 regularization previously studied in iNgI (120041) . and to 
show a large gap between discriminative and generative learning on a Gaussian-mixture distribution. 
The tight bounds can also be used for active learning algorithms in which sample-complexity bounds 
are used to decide on the next label to query. 

In this paper we focus only on large margin classification. But in order to obtain the distribution- 
specific lower bound, we develop new tools that we believe can be useful for obtaining lower bounds 
also for other learning rules. We provide several new results which we use to derive our main results. 
These include: 

• Linking the fat-shattering of a sample with non-negligible probability to a difficulty of learn- 
ing using MEM. 

• Showing that for a convex hypothesis class, fat-shattering is equivalent to shattering with 
exact margins. 

• Linking the fat-shattering of a set of vectors with the eigenvalues of the Gram matrix of the 
vectors. 

• Providing a new lower bound for the smallest eigenvalue of a random Gram matrix gener- 
ated by sub-Gaussian variables. This bound extends previous results in analysis of random 
matrices. 



Some of the results in this work have appeared in a short format in lSabato et al.l (|2010h . 



Paper structure We discuss related work on sample-complexity upper bounds in Section |2l We 
present the problem setting and notation in Section |3] and provide some necessary preliminaries in 
Section m We then introduce the margin-adapted dimension in Section [5] The sample-complexity 
upper bound is proved in Section |6] We prove the lower bound in Section |7j In Section [8] we 
show that any non-trivial sample-complexity lower bound for more general distributions must em- 
ploy properties other than the covariance matrix of the distribution. We summarize and discuss 
implication in Section |9l Proofs omitted from the text are provided in Appendix lAl 

2. Related work 

As mentioned above, most work on "sample complexity lower bounds" is directed at proving that 
under some set of assumptions, there exists a data distribution for which one nee ds at least a certain 



number of examples to learn with required error and confidence (for instance lAntos and LugosiL 
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19981 : Ehrenfeucht et all 1 19881 : iGentile and Hebnboldl 1 19981) . This type of a lower bound does 
not, however, indicate much on the sample complexity of other distributions under the same set of 
assumptions. 



For distribution-specific lower bounds, the classical analysis of Vapnik (|Vapnikl . ll995l . Theorem 
16.6) provides not only sufficient but also necessary conditions for the leamability of a hypothesis 
class with respect to a specific distribution. The essential condition is that the metric entropy of the 
hypothesis class with respect to the distribution be sub-linear in the limit of an infinite sample size. 
In some sense, this criterion can be seen as providing a "lower bound" on learnability for a specific 
distribution. However, we are interested in finite-sample convergence rates, and would like those 
to depend on simple properties of the distribution. The asymptotic arguments involved in Vapnik's 
gen eral leamability c l aim d o not lend themselves easily to such analysis. 



Benedek and Itail (119911) show that if the distribution is known to the learner, a specific hypoth- 



esis class is lear nable if and only i f there is a finite e-cover of this hypothesis class with respect to 
the distribution. iBen-David et all (l2008b consider a similar setting, and prove sample complexity 
lower bou nds for learning w ith any data distribution, for some binary hypothesis classes on the real 
line. Vay atis and AzencotJ (|l 999 ) provide distribution-specific sample complexity upper bounds for 
hypothesis classes with a Umited VC-dimension, as a function of how balanced the hypotheses are 
with respect to the considered distributions. These bounds are not tight for all distributions, thus 
they also do not fully characterize the distribution-specific sample complexity. 

As can be seen in Eq. ([Til, we do not tig htly characterize the depend ence of the sample com- 
plexity on the desired error (as done e.g. in ISteinwart and ScovelL l2007h . thus our bounds are not 
tight for asymptotically small error levels. Our results are most significant if the desired error level 
is a constant well below chance but bounded away from zero. This is in contrast to classical statis- 
tical asy mptotics that are also typ ically tight, but are valid only for very small e. As was recently 
shown by Liang and Srebro 2010L the sample complexity for very small e (in the classical statisti- 
cal asymptotic regime) depends on quantities that can be very different from those that control the 
sample complexity for moderate error rates, which are more relevant for machine learning. 



3. Problem setting and definitions 

Consider a domain X, and let D be a distribution over X x {±1}. We denote by Dx the marginal 
distribution of D on X. The misclassification error of a classifier /i : — >• M on a distribution D is 

eo{h,D)^¥^xx)-D[Y-h{X)<0]. 
The margin error of a classifier w with respect to a margin 7 > on D is 

i^{h,D) ^¥^x,y)-d[Y ■ h{X) < j]. 
For a given hypothesis class T-L C {±1}'^, the best achievable margin error on D is 

tJU.D) ^ inf Uh,D). 

We usually write simply i*{D) since T-L is clear from context. 

A labeled sample is a (multi-)set S = {{xi,yi)}^i C A" x {±1}. Given S, we denote the set 
of its examples without their labels by Sx — {xi, ■ ■ ■ , Xm}- We use 5 also to refer to the uniform 
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distribution over the elements in S. Thus the misclassification error of /i : — > {±1} on 5 is 

iih,S)^^\{i\yi-hix^)<0}\, 

and the 7-margin error on 5 is 

i^ih,S)^-\{i\yi-h{xi)<j}\. 
m 

A learning algorithm is a function A : yj'^^i{X x {±1})"* R'^, that receives a training set as 
input, and returns a function for classifying objects in X into real values. The high-probabiUty loss 
of an algorithm A with respect to samples of size m, a distribution D and a confidence parameter 

6 G (0, 1) is 

l{A,D,m,5) = inf{e > | Ps-D- [^(.4(5), £>) > e] < 5}. 

In this work we investigate the sample complexity of leaming using margin-error minimization 
(MEM). The relevant class of algorithms is defined as follows. 

Definition 1 An margin-error minimization (MEM) algorithm A maps a margin parameter 7 > 
to a leaming algorithm A^, such that 

V5 C A- X {±1}, A-^{S) e argmin£^(/i, S). 

hen 

The distribution-specific sample complexity for MEM algorithms is the sample size required to 
guarantee low excess error for the given distribution. Formally, we have the following definition. 

Definition 2 (Distribution-specific sample complexity) Fix a hypothesis class % Q {±1}'^. For 

7 > 0, e, 5 G [0, 1], and a distribution D, the distribution-specific sample complexity, denoted 
by m{€,'-f,D,5), is the minimal sample size such that for any MEM algorithm A and for any 
m > m(e, 7, D, 5), 

£o{Aj,D,m,S)-£*{D) < e. 

Note that we require that all possible MEM algorithms do well on the given distribution. This is 
because we are interested in the MEM strategy in general, and thus we study the guarantees that 
can be provided regardless of any specific MEM implementation. We sometimes omit 6 and write 
simply m{e, 7, D), to indicate that S is assumed to be some fixed small constant. 

In this work we focus on linear classifiers. For simplicity of notation, we assume a Euclidean 
space for some integer d, although the results can be easily extended to any separable Hilbert 
space. For a real vector x, \\x\\ stands for the Euclidean norm. For a real matrix X, ||X|| stands for 
the Euchdean operator norm. 

Denote the unit ball in M.'^ by Mf = {w^'U.'^\ \\w\\ < 1}. We consider the hypothesis class of 
homogeneous linear separators, W = {x t-^ {x,w) \ w ^ Mf}. We often slightly abuse notation by 
using w to denote the mapping x ^ {x,w). 

We often represent sets of vectors in using matrices. We say that X G M^^xti is the matrix of 
a set {xi, . . . , Xm} ^ M"^ if the rows in the matrix are exactly the vectors in the set. For uniqueness, 
one may assume that the rows of X are sorted according to an arbitrary fixed full order on vectors in 
M'^. For a PSD matrix X denote the largest eigenvalue of X by Amax(X) and the smallest eigenvalue 
by XmM- 

We use the 0-notation as follows: 0(f(z)) stands for Ci + C2f(z) for some constants 
Ci,C2 > 0. n{f{z)) stands for C2f{z) - Ci for some constants Ci,C2 > 0. d{f{z)) stands 
for f {z)p{\a{z)) + C for some polynomial p{-) and some constant C > 0. 
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4. Preliminaries 

As mentioned above, for the hypothesis class of linear classifiers W, one can derive a sample- 
complexity upper bound of the form 0(i?^/7^e^), where = E y^r)[||X|P] and e is the exces s 
error relative to the 7-margin loss. This can be achieved as follows (iBartlett and Mendelsonl . 120021) . 
Let Z be some domain. The empirical Rademacher complexity of a class of functions T C R-^ 
with respect to a set 5 = {-Zi}ie[m] Q Z h 

n{F,S) = -E,[|sup V aif{zi)\l 

where a = (cji, . . . , iTm) are m independent uniform {ibl}-valued variables. The average 
Rademacher complexity of F with respect to a distribution D over Z and a sample size m is 

Assume a hypothesis class T-L C M-^ and a loss function £:{ibl}xM— )-M. For a hypothesis 
h £ T-L, we introduce the function h£ : X x {±1} — M, defined by hi{x,y) = l{y,h{x)). We 
further define the function class Tie = {hi \ h e Ti} ^ R^^i^'^}. 

Assume that the range of T-Li is in [0,1]. For any 5 € (0,1), with probability of 1 — 5 
over the draw of samples S C X x {±1} of size m according to D, every h ^ T-L satisfies 
(|Bartlett and Mendelson . l2002h 



l{h, D) < i{h, S) + 2nM, D) + . / ^^'^(^/'^) . (2) 

V m 

To get the desired upper bound for linear classifiers we use the ramp loss, which is defined as 
follows. For a number r, denote \r\ = min(max(r, 0), 1). The 7-ramp-loss of a labeled example 
(x,y) G M'^ X {±1} with respect to a linear classifier ?i; G is ramp^(^i;, X, y) = \1 — y{w^x) /'^\. 
Let ramp^(?i;, D) = E(x,y)~D[ramP'y(u^, X, Y)], and denote the class of ramp-loss functions by 

RAMP^ = {(x, y) I— ramp^(t(;, x, y) Bf }. 

The ramp-loss is upper-bounded by the margin loss and lower-bounded by the misclassification 
error. Therefore, the following result can be shown. 

Proposition 3 For any MEM algorithm A, we have 



£o{A-y,D,m,5) <tJn,D) + 2nrn{RAMP^,D) + \P^^^^^^. (3) 
' V m 

We give the proof in Appendix lA. II f or com pleteness. Since the 7-ramp loss is I/7 Lipschitz, it 
follows from 



Bari:lett and Mendelsonl (120021) that 



7^m(RAMP^,D) < 



7^771 



Combining this with Proposition |3] we can conclude a sample complexity upper bound of 

0(5V72e2). 

In addition to the Rademacher complexity, we w ill also use the classic notions of fat-shattering 
(IKeams and Schapirelll994 ) and pseudo-shattering (IPoUardll 19841) . defined as follows. 
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Definition 4 Let T be a set of functions / : A" — t- R, and let 7 > 0. The set {xi, . . . , Xm} ^ 
is 7-shattered by T with the witness r £ M"* if for all y G {±1}'" there is an f £ such that 

Mi G [m], y[i]{f{xi) - r[i]) > 7. 

The 7-shattering dimension of a hypothesis class is the size of the largest set that is 7-shattered by 
this class. We say that a set is '^-shattered at the origin if it is 7-shattered with the zero vector as a 
witness. 

Definition 5 Let T be a set of functions / : — )• M, and let 7 > 0. The set {xi, . . . , Xm} ^ is 
pseudo-shattered by T with the witness r G if for all y G {±1}™ there is an f £ F such that 
Mi G [m], y[i]{f{xi) - r[i]) > 0. 

The pseudo-dimension of a hypothesis class is the size of the largest set that is pseudo-shattered by 
this class. 

5. The margin-adapted dimension 

When considering learning of linear classifiers using MEM, the dimension-based upper bound and 
the norm-based upper bound are both tight in the worst-case sense, that is, they are the best bounds 
that rely only on the dimensionality or only on the norm respectively. Nonetheless, neither is tight in 
a distribution-specific sense: If the average norm is unbounded while the dimension is small, then 
there can be an arbitrarily large gap between the true distribution-dependent sample complexity 
and the bound that depends on the average norm. If the converse holds, that is, the dimension is 
arbitrarily large while the average-norm is bounded, then the dimensionality bound is loose. 

Seeking a tight distribution-specific analysis, one simple approach to tighten these bounds is to 
consider their minimum, which is proportional to min(d, i?^/7^). Trivially, this is an upper bound 
on the sample complexity as well. However, this simple combination is also not tight: Consider a 
distribution in which there are a few directions with very high variance, but the combined variance 
in all other directions is small (see Figure [T}. We will show that in such situations the sample com- 
plexity is characterized not by the minimum of dimension and norm, but by the sum of the number 
of high-variance dimensions and the average squared norm in the other directions. This behavior is 
captured by the margin-adapted dimension which we presently define, using the following auxiliary 
definition. 

Definition 6 Let b > and let k be a positive integer A distribution Dx over is (6, /c)-limited 
if there exists a sub-space V '^W^ of dimension d — k such that Ex~Djf [\\^V " ^ where 

Oy is an orthogonal projection onto V. 




Figure 1: Illustrating covariance matrix ellipsoids, left: norm bound is tight; middle: dimension 
bound is tight; right: neither bound is tight. 
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Definition 7 (margin-adapted dimension) The margin-adapted dimension of a distribution Dx, 
denoted by k^{Dx), is the minimum k such that the distribution is k)-limited. 

We sometimes drop the argument of ky when it is clear from context. It is easy to see that for any 
distribution Dx over M'^, ky{Dx) < min((i,E[||X||2]/72). Moreover, k^ can be much smaller than 
this minimum. For example, consider a random vector X E R^'^*'^ with mean zero and statistically 
independent coordinates, such that the variance of the first coordinate is 1000, and the variance in 
each remaining coordinate is 0.001. We have ki = 1 but (i = E[||X|p] = 1001. 

ky{Dx) can be calculated from the uncentered covariance matrix Ex~Dx[^^^] follows: 
Let Ai > A2 > • • • Arf > be the eigenvalues of this matrix. Then 

d 

ky = mm{k | ^ Aj < 7^/?}. (4) 

i=k+l 



A quantity similar to this definition of k^ was studied previously in lBousqued (120021) . The eigenval- 
ues of the empirical covariance matrix were used to provide sample complexity bounds, for instance 
in iSchoUcopf et al. ( 1999h . However, ky generates a different type of bound, since it is defined based 



on the eigenvalues of the distribution and not of the sample. We will see that for small finite samples, 
the latter can be quite different from the former. 

Finally, note that while we define the margin-adapted dimension for a finite-dimensional space 
for ease of notation, the same definition carries over to an infinite-dimensional Hilbert space. More- 
over, ky can be finite even if some of the eigenvalues Aj are infinite, implying a distribution with 
unbounded covariance. 



6. A Distribution-Dependent Upper Bound 

In this section we prove an upper bound on the sample complexity of learning with MEM, using 
the margin-adapted dimension. We do this by providing a tighter upper bound for the Rademacher 
complexity of RAMP^. We bound 7^m(RAMP^, D) for any {B'^, /c)-limited distribution Dx, using 
L2 covering numbers, defined as follows. 

Let {X, II • ||o) be a normed space. An 77-covering of a set C X with respect to the norm 
II • ||o is a set C C such that for any / G there exists a. g £ C such that ||/ — g\\o < rj. 
The covering-number for given r] > 0, T and o is the size of the smallest such ?7-covering, and is 
denoted by J\f{rj,T,o). Let S = {xi, . . .,Xm} C R-^. For a function f : ^ R, the ^2(5') 
norm of / is ||/||l2(S) = y^Ex~s'[/(-'^)^]- Thus, we consider covering-numbers of the form 
7V(r?,RAMP^,L2(5)). 

The empirical Rademacher complexity of a function c l ass ca n be bounded by the L2 covering 



numbers of the same function class as follows (IMendelsonl . |2002| . Lemma 3.7): Let = 2 Then 



^^■^(RAMP^, 5") <C ^2 ei-iYln7\A(ei,RAMP-y,L2(5)) + 2eArV"i- (5) 

ielN] 

To bound the covering number of RAMP^, we will restate the functions in RAMP^ as sums of two 
functions, each selected from a function class with bounded complexity. The first function class will 
be bounded because of the norm bound on the subspace V used in Def. [6l and the second function 
class will have a bounded pseudo-dimension. However, the second function class will depend on the 
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choice of the first function in the sum. Therefore, we require the following lemma, which provides 
an upper bound on such sums of functions. We use the notion of a Hausdorjf distance between two 
sets Qi,G2 ^ X, defined as A^^(C/i, C/2) = sup^^^gg^ \nig^(zg^ \\ 

9i 52||o- 

Lemma 8 Let (Af, || • ||o) be a normed space. Let C X be a set, and let Q : X ^ 2"^ be a 
mapping from objects in X to sets of objects in X. Assume that Q is c-Lipschitz with respect to the 
Hausdorjf distance on sets, that is assume that 

yfuf2 e x,AHigifi),gif2)) < c||/i - /2II0. 

LetJ^g = {f + g\f£:F,gG Gif)}. Then 

AA(r?, J-g, o) < AA(ry/(2 + c),T, o) • supAf{ri/{2 + c),g{f),o). 

feJ' 

Proof For any set ^ C X, denote by Ca a minimal ?7-covering for A with respect to || • ||o, so 
that \Ca\ = M{r],A,o). Let f + g G Tg such that f € T,g € g{f). There is a / G Cjr 
such that 11/ — /||o < rj. In addition, by the Lipschitz assumption there is a ^ G G{f) such that 
lb — g\\o < c||/ — /||o < cq. Lastly, there is a ^ G Cg^^^ such that \\g — g\\o < rj. Therefore 

11/ + 5 - (/ + Mo < 11/ - ho + lb - 5II0 + lb - 5II0 < (2 + c)r/. 

Thus the set {/ + g | / G Cjr,g G Cg(j)} is a (2 + c)ry cover of Fg. The size of this cover is at 
most |Cj-| • supjg_^|Cg(j)| < A/'(r?,J",o) •supyg_^AA(r/,g(/),o). ■ 



The following lemma provides us with a useful class of mappings which are 1-Lipschitz with 
respect to the Hausdorff distance, as required in Lemma |8] The proof is provided in Appendix lA. 21 

Lemma 9 Let f : X ^ M.be a function and let Z C M'^ be a function class over some domain X. 
Let g ^ 2^^ be the mapping defined by 

g{f) ^{x^ lf{x) + z{x)} - fix) \zeZ}. (6) 

Then g is 1-Lipschitz with respect to the Hausdorjf distance. 

The function class induced by the mapping above preserves the pseudo-dimension of the original 
function class, as the following lemma shows. The proof is provided in Appendix IA.3I 

Lemma 10 Let f : X ^ be a function and let Z C M'^ be a function class over some domain 
X. Let g{f) be defined as in Eq. ((61). Then the pseudo-dimension ofg{f) is at most the pseudo- 
dimension of Z. 

Equipped with these lemmas, we can now provide the new bound on the Rademacher com- 
plexity of RAMP^ in the following theorem. The subsequent coroUary states the resulting sample- 
complexity upper bound for MEM, which depends on A;^. 

Theorem 11 Let D be a distribution over x {±1}, and assume Dx is {B'^ , k)-limited. Then 

I Oik + h^)\n(m) 

7^ RAMP^,D < W— ^ '-^ 

V m 
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Proof In this proof all absolute constants are assumed to be positive and are denoted by C or C, for 
some integer i. Their values may change from line to line or even within the same line. 

Consider the distribution D which results from drawing {X, Y) ^ D and emitting {Y ■ X, 1). 
It too is (i?^, fc)-limited, and TI{ramPj, D) = 7^(ramp^, D). Therefore, we assume without loss 
of generality that for all {X, Y) drawn from D, Y = 1. Accordingly, we henceforth omit the y 
argument from ramp^(w, x, y) and write simply ramp^(ii;, x) = ramp^(tt;, x, 1). 

Following Def. [6l Let Oy be an orthogonal projection onto a sub-space V of dimension d — k 
such that Ex~Dx[ll®v • ^W^] < B^- Let V be the complementary sub-space to V. For a set 
5 = {xi, . . . , x^} C M^, denote B{S) = ^jE^Ij^T^T^. 

We would like to use Eq. to bound the Rademacher complexity of RAMP^. Therefore, we 
will bound M{ri, RAMP^, L2{S)) for r? > 0. Note that 

ramp^(u;,x) = [1 - (u;,x)/7]] = 1 - l{w,x)/-f}. 

Shifting by a constant and negating do not change the covering number of a function class. There- 
fore, A/'(7?, RAMP^, ^2(5")) is equal to the covering number of {x 1— )• [[(?i;,x)/7] | w G B^}. 
Moreover, let 

RAMpI^ = {x^ l{Wa + Wb, x)/-i\ \WaeM'ir\V, Wb^ V] . 

Then {x ^ {{w, x) /^\ \ w e } C ramp^, thus it suffices to bound J\f{rj, RAMP^, ^2(6')). To 
do that, we show that RAMP'^ satisfies the assumptions of Lemma [8] for the normed space (R'*'*, || • 
\\l2(S))- Define 

T={X^ {Wa, X)h \Wa^^\r\ V} . 

Let Q : M'^'' 2'**'* be the mapping defined by 

Q{!) = ^ lf{x) + (^6,x)/7l - f{x) I wb e V}. 

Clearly, -Fg = {/ + 5 | / G J^, fl' G ^(/)} = RAMP^. Furthermore, by Lemma|9] Q is 1-Lipschitz 
with respect to the Hausdorff distance. Thus, by Lemma [8] 

AA(7?,RAMP;,L2(5)) <AA(7?/3,^,L2(5))-supAr(r7/3,g(/),L2(5)). (7) 

We now proceed to bound the two covering numbers on the right hand side. First, consider 
J\f{r]/3,Q{f), L2{S)). By Lemma [TOl the pseudo-dimension of Q{f) is the same as the pseudo- 
dimension of {x I—)- {w, x)/"f \ w £ V}, which is exactly k, the dimension of V. The L2 covering 



numb er of G{f) can be bounded by the pseudo-dimension of G{f) as follows (see e.g. iBartletti 



20061, Theorem 3.1): 



AA(r//3,g(/),L2(5))<Ci(^) . (8) 



Second, consider Miv/S, F. L^jS)). Sudakov's minoration theorem (ISudakovlll97ll and see also 



Ledoux and Talagrandlll99lL Theorem 3.18) states that for any 77 > 
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where s = (si, . . . , Sm) are independent standard normal variables. The right-hand side can be 
bounded as follows: 

m m 

7E5[sup I = sup \{w,y^jiXi)\] 



<Ks[\\J2siOvx^\\] < 



1=1 



i=l V ielm] 



Therefore ln7V(r/, J", L2(5)) < C-^^. Substituting this and Eq. dH) for the right-hand side in 
Eq. dT), and adjusting constants, we get 

lnAA(r/,RAMP^,L2(5)) < lnAA(r?, RAMP' L2(5)) < Ci(l + A:ln(^) + ^^), 
To finalize the proof, we plug this inequality into Eq. dD to get 



\/^7^(RAMP^,5) < Ci ^ ei-iJl + k\n{C2/ei) + ^-^- + 2eNVm 



<cAy, Ei^i [l + ^A;ln(C2/e,;) + 



Ve[w] ie[N] ie[N] ^ / 

<C^1 + Vk + ^^^^ + 2-^+1 v^. 

In the last inequality we used the fact that ^ ■ i2^*+^ < 4. Setting = ln(2m) we get 

7^(RAMP^, S) < —= 1 + Vfc + — — ^ ^ . 

Vm V 7 / 

Taking expectation over both sides, and noting that E[i?(S')] < y/E[B'^{S)] < B, we get 



r.. C , r- Bln(2m), /0(A; + S2ln^(2m)/72) 
7^ RAMP^, 5 < ^1 + Vfe + ^ ^ < \\ — ^ '-^-^ 



Corollary 12 (Sample complexity upper bound) Let D be a distribution over x {±1}. Then 
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Proof By Proposition |3l we have 



o(^^, D, m, 5) < D) + 27^™(RAMP^, D) + ^ Hl^J^Z^, 

By definition of kj{Dx), Dx is {j'^kj, A;^)-iimited. Therefore, by Theorem [TT] 



7^m(RAMP^, D) < 

We conclude that 



0{k^{Dx))Hm) 



m 



0{k^{Dx)Hm) + Hl/6)) 



m 

Bounding the second right-hand term by e, we conclude that m(e, 7, D) < 0(fc^/e^). ■ 

One should note that a similar upper bound can be obtained much more easily under a uniform 
upper bound on the eigenvalues of the uncentered covariance matrixjl. However, such an upper 
bound would not capture the fact that a finite dimension implies a finite sample complexity, re- 
gardless of the size of the covariance. If one wants to estimate the sample complexity, then large 
covariance matrix eigenvalues imply that more examples are required to estimate the covariance 
matrix from a sample. However, these examples need not be labeled. Moreover, estimating the 
covariance matrix is not necessary to achieve the sample complexity, since the upper bound holds 
for any margin-error minimization algorithm. 

7. A Distribution-Dependent Lower Bound 

The new upper bound presented in Cor. [T2]can be tighter than both the norm-only and the dimension- 
only upper bounds. But does the margin-adapted dimension characterize the true sample complexity 
of the distribution, or is it just another upper bound? To answer this question, we first need tools 
for deriving sample complexity lower bounds. Section ITTI relates fat-shattering with a lower bound 
on sample complexity. In Section 17^ we use this result to relate the smallest eigenvalue of a Gram- 
matrix to a lower bound on sample complexity. In Section 1731 the family of sub-Gaussian product 
distributions is presented. We prove a sample-complexity lower bound for this family in Section 174] 

7.1 A sample complexity lower bound based on fat-shattering 

The ability to learn is closely related to the probability of a samp le to be shatte red, as evident in 



Vapnik's formulations of leamability as a function of the e-entropy (IVapnikl. 119951) . It is well known 
that the maximal size of a shattered set dictates a sample-complexity upper bound. In the theorem 
below, we show that for some hypothesis classes it also implies a lower bound. The theorem states 
that if a sample drawn from a data distribution is fat-shattered with a non-negligible probability, 
then MEM can fail to learn a good classifier for this distributionJl This holds not only for linear 



This has been pointed out to us by an anonymous reviewer of this manuscript. An upper bound under sub-Gaussianity 
assumptions can be found in Sabato etal.l hold) 

In contrast, the average Rademacher complexity cannot be used to derive general lower bounds for MEM algorithms, 
since it is related to the rate of uniform convergen ce of the entire hypothesis class, while MEM algorithms choose 
low-error hypotheses (see e.g. lSartlett et al.[|2005l) . 
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classifiers, but more generally for all symmetric hypothesis classes. Given a domain X, we say that 
a hypothesis class % C M'^ is symmetric if for all /i G we have — /i G H as well. This clearly 
holds for the class of linear classifiers W. 

Theorem 13 Let X be some domain, and assume that % C M'^ is a symmetric hypothesis class. Let 
D be a distribution over X x {±1}. If the probability of a sample of size m drawn from to be 
^-shattered at the origin by W is at least rj, then m(e, 7, D, 77/2) > [m/2j for all e < 1/2 — (*^{D). 

Proof Let e < 5 - i*{D). We show a MEM algorithm A such that 

£o(A, D, \m/2\ , n/2) > ^ > e*{D) + e, 

thus proving the desired lower bound on m(e, 7, D, r]/2). 

Assume for simplicity that m is even (otherwise replace m with m — 1). Consider two sets 
S, S C X X {±1}, each of size m/2, such that Sx U Sx is 7-shattered at the origin by W. Then 
there exists a hypothesis hi € "H such that the following holds: 

• For all X e Sx^ Sx, \hi{x)\ > 7. 

• For all {x,y) G S, siga{hi{x)) = y. 

• For all (x, y) G S, sign(/ii(a;)) = —y. 

It follows that £j{hi,S) = 0. In addition, let /12 = —hi. Then £j{h2,S) = 0. Moreover, we 
have h2 & H due to the symmetry of H. On each point in X, at least one of hi and /12 predict the 
wrong sign. Thus io{hi, D) + £o{h2, D) > 1. It follows that for at least one of i G {1, 2}, we have 
£o{hi, D) > ^. Denote the set of hypotheses with a high misclassification error by 

n^ = {hen\Uh,D) > 

We have just shown that if Sx U Sx is 7-shattered by W then at least one of the following holds: 
(1) hi G "H® n argmin^jg^ iy{h, S) or (2) /i2 G "H® n argmin^^^ £^{h, S). 

Now, consider a MEM algorithm ^ such that whenever possible, it retums a hypothesis from 
H^. Formally, given the input sample S, if fl argmin^g^ ij{h, S) 7^ 0, then .4(5') G fl 
argmin^g^ £j{h, S). It follows that 

Fs^j,r./2[ioiAiS),D) > i] > Ps^^™/2[1^« n argmin£^(/i,5) / 0] 

= o(P5~DW2[^® n aigmm i^{h, S) 7^ 0] + Pc^bW2[H® n argmin£^(^, S) 7^ 0]) 
2 hen hen 

> ^{Fg [H® n argmin£^(/i, 5) 7^ OR n argmin£^(/i, 5) 7^ 0]) 

— Sr^D"^/'^ ["^^ ^ 7-shattered at the origin ] . 

The last inequality follows from the argument above regarding hi and /i2. The last expression is 
simply half the probability that a sample of size m from Dx is shattered. By assumption, this 



13 



Sabato, Srebro and Tishby 



probability is at least rj. Thus we conclude that ¥g^j^m/2 [lo{A{S),D) > 5] > 77/2. It follows that 
eo{A^,D,m/2,r,/2)>^. ■ 

As a side note, it is interesting to observe that Theorem [13] does not hold in general for non- 
symmetric hypothesis classes. For example, assume that the domain is X = [0, 1], and the hypoth- 
esis class is the set of all functions that label a finite number of points in [0, 1] by +1 and the rest by 
—1. Consider learning using MEM, when the distribution is uniform over [0, 1], and all the labels 
are —1. For any m > and 7 € (0, 1), a sample of size m is 7-shattered at the origin with prob- 
ability 1. However, any learning algorithm that returns a hypothesis from the hypothesis class will 
incur zero error on this distribution. Thus, shattering alone does not suffice to ensure that learning 
is hard. 

7.2 A sample complexity lower bound with Gram-matrix eigenvalues 

We now return to the case of homogeneous linear classifiers, and link high-probability fat-shattering 
to properties of the distribution. First, we present an equivalent and simpler characterization of fat- 
shattering for linear classifiers. We then use it to provide a sufficient condition for the fat-shattering 
of a sample, based on the smallest eigenvalue of its Gram matrix. 

Theorem 14 Let X G ]g™x<i jy^ j/j^ matrix of a set of size m in W^. The set is ^-shattered at the 
origin by W if and only ifKK^ is invertible and for all y G {±1}™", y^(XX-^)~^y < 7"^. 

To prove Theorem [14] we require two auxiliary lemmas. The first lemma, stated below, shows that 
for convex function classes, 7-shattering can be substituted with shattering with exact 7-margins. 

Lemma 15 Let J- C M'^ be a class of functions, and assume that T is convex, that is 

yfij2 e -F,VA G [0, 1], A/i + (1 - A)/2 G 

If S = {xi, . . . , Xm} ^ X is '^-shattered by T with witness r G M"^, then for every y G {±1}™ 
there is an f £ T such that for all i G [m], y[i]{f{xi) — r[i]) =7. 

The proof of this lemma is provided in Appendix IA.4I The second lemma that we use allows 
converting the representation of the Gram-matrix to a different feature space, while keeping the 
separation properties intact. For a matrix M, denote its pseudo-inverse by M+. 

Lemma 16 Let X G R™^"^ be a matrix such that XX"^ is invertible, and let Y G such that 

XX"^ = YY"^. Let r G be some real vector If there exists a vector u; G M*^ such that Yw = r, 
then there exists a vector w such that X.w = r and \\w\\ = \\Y'^ (Y'^)~^w\\ < \\w\\. 

Proof Denote K = XX^ = YY^. Let S = Y^K^^X and let w = S'^w. We have Xw = XS^w = 
XX^K^^Yw = Yw = r. In addition, = w'^w = w'^SiS'^w. By definition of §, 

SS'T = Y^IK^^XX^K^^Y = Y^K^^Y = Y^(YY'^)"^Y = Y^(Y^)+. 

Denote O = Y^(Y^)+. O is an orthogonal projection matrix: by the properties of the pseudo- 
inverse, = 0^ and = O. Therefore = w^SS^w = w^Ow = w'^OO'^w = \\Ow\\^ < 



14 



Proof [of Theorem [141 We prove the theorem for 1-shattering. The case of 7-shattering follows 
by rescaling X appropriately. Let XX^ = UAU'^ be the SVD of XX^, where U is an orthogonal 
matrix and A is a diagonal matrix. Let Y = UA2. We have XX^ = YY^. We show that the 
specified conditions are sufficient and necessary for the shattering of the set. 

Sufficient: If XX-^ is invertible, then A is invertible, thus so is Y. For any y G {±1}™, Let 
Wy = Y^^y. Then Ywy = y. By Lemma [T6[ there exists a separator w such that Xiu = y and 

\\w\\ < = v/y^(YY^)-iy = ^FIXX^T^ < 1- 

Necessary: If XX'^ is not invertible the n the vectors in S are linearly dependent, thus S cannot be 
shattered using linear separators (see e.g. IVapnikl 119951) . The first condition is therefore necessary. 
Assume S is 1-shattered at the origin and show that the second condition necessarily holds. By 
Lemma[T5l for all y G {±1}™ there exists SLWy such that %.Wy = y. Thus by Lemma[T6] there 
exists a Wy such that Ywy = y and \\wy\\ < \\wy\\ < 1. XX-^ is invertible, thus so is Y. Therefore 
Wy = Y-^y. Thus y'^(XX'^)-iy = y^{YY'^)-^y = \\wy\\ < 1. ■ 



We are now ready to provide a sufficient condition for fat-shattering based on the smallest 
eigenvalue of the Gram matrix. 

Corollary 17 LetX ^W^^'^ be the matrix of a set of size m in W^. /fAminCXX^) > 7717^ then the 
set is j-shattered at the origin by W. 

Proof If Amin(XX^) > m-f"^ then XX^ is invertible and Amax((XX'^)-i) < {mj'^)-'^. For any 
y G {±1}'" we have ||y|| = ^/m and 

y^(XX^)-iy < ||yf A^ax((XX^)"') < m{m-f^)-^ = 7"'. 
By Theorem [14] the sample is 7-shattered at the origin. ■ 

Cor. [TT] generalizes the requirement of linear independence for shattering with no margin: A set 
of vectors is shattered with no margin if the vectors are linearly independent, that is if Amin > 0. 
The corollary shows that for 7-fat-shattering, we can require instead Amin > m7^. We can now 
conclude that if it is highly probable that the smallest eigenvalue of the sample Gram matrix is 
large, then MEM might fail to learn a good classifier for the given distribution. This is formulated 
in the following theorem. 

Theorem 18 Let D be a distribution over M.^ x {±1}. Let m > and let X be the matrix of 
a sample drawn from D^. Let rj = F[Ainin(XX"^) > 7717^]. Then for all e < 1/2 — i*^{D), 

777(e,7,Z),7//2) > [777/2J. 

The proof of the theorem is immediate by combining Theorem [13] and Cor [T7] 

Theorem [18] generalizes the case of learning a linear separator without a margin: If a sample of 
size m is linearly independent with high probability, then there is no hope of using m/2 points to 
predict the label of the other points. The theorem extends this observation to the case of learning 
with a margin, by requiring a stronger condition than just linear independence of the points in the 
sample. 
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Recall that our upper-bound on the sample complexity from Section|6]is 0{k^). We now define 
the family of sub-Gaussian product distributions, and show that for this family, the lower bound that 
can be deduced from Theorem [18] is also linear in k.y. 

7.3 Sub-Gaussian distributions 

In order to derive a lower bound on distribution-specific sample complexity in terms of the co- 
variance of X ~ Dx, we must assume that X is not too heavy-tailed. This is because for any 
data distribution there exists another distribution which is almost identical and has the same sample 
complexity, but has arbitrarily large covariance values. This can be achieved by mixing the original 
distribution with a tiny probability for drawing a vector with a huge norm. We thus restrict the 
discussion to multidimensional sub-Gaussian distributions. This ensures light tails of the distribu- 
tion in all directions, while still allowing a rich family of distributions, as we presently see. Sub- 
Gaussianity is defined for scalar random variables as follows (see e.g. IBuldygin and Kozachenkd . 
E998). 

Definition 19 (Sub-Gaussian random variables) A random variable X G M w sub-Gaussian with 
moment B, for B > 0, if 

yt G M, E[exp{tX)] < exp{t'^B'^/2). 

In this work we further say that X is sub-Gaussian with relative moment /o > ifX is sub-Gaussian 
with moment p^JW.[X'^], i.e. 

yt e M, E[exp(iX)] < exp(iV^IE[x2]/2). 

Note that a sub-Gaussian variable with moment B and relative moment p is also sub-Gaussian with 
moment B' and relative moment p' for any B' > B and p' > p. 

The family of sub-Gaussian distributions is quite extensive: For instance, it includes any 
bounded, Gaussian, or Gaussian-mixture random variable with mean zero. Specifically, if X is 
a mean-zero Gaussian random variable, X ~ A^(0, o"^), then X is sub-Gaussian with relative mo- 
ment 1 and the inequalities in the definition above hold with equality. As another example, if X is a 
uniform random variable over {±6} for some 6 > 0, then X is sub-Gaussian with relative moment 
1, since 

E[exp(tX)] = ^(exp(t6) + exp(-t6)) < exp{t'^b'^ /2) = ex.j){t'^E[X'^]/2). (9) 

Let IB G M°'^'^ be a symmetric PSD matrix. A random vector X G M'* is a sub-Gaussian random 
vector with moment matrix B if for all u G M'^, E[exp((ii, X))] < exp((]Bn, u) /2). The following 
lemma provides a useful connection between the trace of the sub-Gaussian moment matrix and 
the moment-generating function of the squared norm of the random vector. The proof is given in 
Appendix IA.5I 

Lemma 20 Let X ^ be a sub-Gaussian random vector with moment matrix B. Then for all 
t G (0, ix^l E[exp(t||X||2)] < exp(2t • trace(B)). 

Our lower bound holds for the family of sub-Gaussian product distributions, defined as follows. 
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Definition 21 (Sub-Gaussian product distributions) A distribution Dx over is a sub- 
Gaussian product distribution with moment B and relative moment p if there exists some orthonor- 
mal basis ai, . . . ,ad G M^, such that for X ~ Dx, {ai,X) are independent sub-Gaussian random 
variables, each with moment B and relative moment p. 

Note that a sub-Gaussian product distribution has mean zero, thus its covariance matrix is equal to 
its uncentered covariance matrix. For any fixed p > 0, we denote by V^p the family of all sub- 
Gaussian product distributions with relative moment p, in arbitrary dimension. For instance, all 
multivariate Gaussian distributions and all uniform distributions on the comers of a centered hyper- 
rectangle are in V^^. All uniform distributions over a full centered hyper-rectangle are in ^^3^2- Note 
i\i&lifpi<P2,V% QV%. 

We will provide a lower bound for all distributions in V^p. This lower bound is linear in the 
margin-adapted dimension of the distribution, thus it matches the upper bound provided in Cor. [12] 
The constants in the lower bound depend only on the value of p, which we regard as a constant. 



7.4 A sample-complexity lower bound for sub-Gaussian product distributions 



As shown in Section 17.21 to obtain a sample complexity lower bound it suffices to have a lower 
bound on the value of the smallest eigenvalue of a random Gram matrix. The distribution of the 
smallest eigenvalue of a random Gram matrix has been investigated under various assumptions. 
The cleanest results are in the asymptotic case where the sample size and the dimension approach 
infinity, the ratio between them approaches a constant, and the coordinates of each example are 
identically distributed. 



Theorem 22 dSai and SUversteinll201oL Theorem 5.11) Let {^i}^^ be a series of matrices of 



sizes rrii x d,, whose entries are i.i.d. random variables with mean zero, variance a and finite 
fourth moments. IfYmii^oo ^ = P <X then limj_!.oo Amm( J-^i^J") = C7^(l — y/Ji)"^ ■ 

This asymptotic limit can be used to approximate an asymptotic lower bound on m(e, 7, D), if 
Dx is a product distribution of i.i.d. random variables with mean zero, variance cr^, and finite fourth 
moment. Let X G i^'^x'i be the matrix of a sample of size m drawn from Dx- We can find m = ruo 
such that (XX^) f« 7^mo, and use Theorem [T8] to conclude that m(e, 7, D) > mo/2. If d and 
m are large enough, we have by Theorem l22]that for X drawn from D^: 

Amin(XX^) ^ da^{l - = a^{Vd - . 

Solving the equality a^{^/d — ^/m^)'^ = mo7^ we get mo = + 7/0")^. The margin-adapted 
dimension for Dx is A;^ ~ d/{\ + 7^/0"^), thus < rUo < /c-y. In this case, then, the sample 
complexity lower bound is indeed the same order as k^, which controls also the upper bound in 
Cor. [12] However, this is an asymptotic analysis, which holds for a highly limited set of distribu- 
tions. Moreover, since Theorem |22]holds asymptotically for each distribution separately, we cannot 
use it to deduce a uniform finite-sample lower bound for families of distributions. 

For our analysis we requke finite-sample bounds for the smallest eigenvalue of a random Gram- 
matrix. iRudelson and Vershynin (2009. 2008) provide such finite-sample lower bounds for distribu- 



tions which are products of identically distributed sub-Gaussians. In Theorem |23] below we provide 
a new and more general result, which holds for any sub-Gaussian product distribution. The proof 
of Theorem I23] is provided in Appendix IA.6I Combining Theorem |23] with Theorem [TSl above we 
prove the lower bound, stated in Theorem l24]below. 
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Theorem 23 For any p > and 5 G (0, 1) there are /3 > and C > such that the following 
holds. For any Dx G T^'^p with covariance matrix T, < I, and for any m < f3 ■ trace(S) — C, ifH 
is the m X d matrix of a sample drawn from D^, then 

P[Amin(XX^) >m]>5. 

Theorem 24 (Sample complexity lower bound for distributions in Dp^) For any p > there are 
constants /3 > 0, C > such that for any D with Dx S ^p^, for any 7 > and for any e < 

m{en,D,l/4:)>(3k^{Dx)-C. 

Proof Assume w.l.o.g. that the orthonomial basis ai, . . . , of independent sub-Gaussian direc- 
tions of Dx, defined in Def. [2T] is the natural basis ei, . . . , e^. Define A, = Ex~Dx [^[^]^]' ^^'^ 
assume w.l.o.g. Ai > . . . > > 0. Let X be the m x d matrix of a sample drawn from Z)^. 
Fix 6 € (0, 1), and let f3 and C be the constants for p and 6 in Theorem |23] Throughout this proof 
we abbreviate = k^{Dx)- Let m < /3(A:^ — 1) — C. We would like to use Theorem |23] to 
bound Amin(XX^) with high probability, so that Theorem [T8]can be applied to get the desired lower 
bound. However, Theorem |23] holds only if S < /. Thus we split to two cases — one in which the 
dimensionality controls the lower bound, and one in which the norm controls it. The split is based 
on the value of A^.^ . 

Case I Assume A^ > 7^. Then Vi G [k-y], Xi > 7^. By our assumptions on Dx, for all i G [d] 
the random variable X[i] is sub-Gaussian with relative moment p. Consider the random variables 
Z[i] = X[i]/y/l^ for i G [A;^]. Z[i] is also sub-Gaussian with relative moment p, and E[Z[i]^] = 1. 
Consider the product distribution of Z[l], . . . , Z[k^], and let S' be its covariance matrix. We have 
T,' = Ik , and trace(S') = k^. Let Z be the matrix of a sample of size m drawn from this 
distribution. By Theorem |23] P[Amin(^^^) > 'rn] > 6, which is equivalent to 

P[A,ni„(X • diag(l/Ai , . . . , l/Afc^ , 0, . . . , 0) • X^) > m] > 5. 

Since Vi G [kj],Xi > 7^ we have P[Amin(XX^) > rwy"^] > 5. 

Case II Assume A^ < 7^. Then Aj < 7^ for all i G {A;^, . . . ,d}. Consider the random variables 
Z[i] = X[i]/7 for i G {k^, . . . ,d}. Z[i] is sub-Gaussian with relative moment p and E[Z[i]^] < 1. 
Consider the product distribution of Z[ky], . . . , Z[d], and let S' be its covariance matrix. We have 
S' < Id-k-,+1- By the minimality in Eq. (HJl we also have trace(S') = Y^i=kj Xi > k^ — 1. 
Let Z be the matrix of a sample of size m drawn from this product distribution. By Theorem 23 
P[Amin(ZZ'^) >m\>6. Equivalently, 

P[Amm(X • diag(0, . . . , 0, l/7^ . . . , 1/7^) • X^) >m]>6, 

therefore P[Amin(XX^) > 77172] > 6. 

In both cases PfAminlXX"^) > mj'^] > 6. This holds for any m < j3{k^ - 1) - C, thus by 
Theorem[l8]m(e,7,L', V2) > l{p{k^ - 1) - C)/2\ for e < 1/2 - ^;(L>). We finalize the proof 
by setting 5 = ^ and adjusting /3 and C. ■ 
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8. On the limitations of the covariance matrix 

We have shown matching upper and lower bounds for the sample complexity of learning with MEM, 
for any sub-Gaussian product distribution with a bounded relative moment. This shows that the 
margin-adapted dimension fully characterizes the sample complexity of learning with MEM for 
such distributions. What properties of a distribution play a role in determining the sample complex- 
ity for general distributions? In the following theorem we show that these properties must include 
more than the covariance matrix of the distribution, even when assuming sub-Gaussian tails and 
bounded relative moments. 

Theorem 25 For any integer d > \, there exist two distributions D and P over x {±1} 
with identical covariance matrices, such that for any e,6 £ (0, |), m{e,l, P,6) > ^{d) while 
m(e, 1,D,5) < [log2(l/(5)] . Both Dx and Px are sub-Gaussian random vectors, with a relative 
moment of y/2 in all directions. 

Proof Let Da and D}, be distributions over M such that Da is uniform over {±1}'' and Db IS 
uniform over {±1} x {0}'^~^. Let Dx be a balanced mixture of Da and Di,. Let Px be uniform 
over {±1} X {^Y^^- For both D and P, let ¥\Y = (ei, X)] = 1. The covariance matrix of Dx 

and Px is diag(l, ^, • • • , thus ki{Dx) = ki{Px) > n{d). 

By Eq. Q, Px,Da and Di, are all sub-Gaussian product distribution with relative moment 1, 
thus also with moment > 1. The projection of Dx along any direction u G is sub-Gaussian 
with relative moment \/2 as well, since 

Ex~Dx[exp((u,X))] = ^(Ex~D-[exp((u,X))] +E^^^4exp((n,X))]) 

= ^( (exp(iii) + exp(-Ui))/2 + (exp(tii) + exp(-ui))/2) 

ie[d] 

< i(nexpKV2) + exp(n2/2)) <exp(||^z||V2) <exp((||^.||2 + n2)/2) 

= exp(Ex~D^[(^x,X)2]). 

For P we have by Theorem |24]that for any e < i, m(e, l,P, i) > n(ki{Px)) > n{d). In contrast, 
any MEM algorithm will output the correct separator for D whenever the sample has at least 
one point drawn from Df,. This is because the separator ei is the only w € Mf that classifies this 
point with zero 1-margin errors. Such a point exists in a sample of size m with probability 1 — 2~™. 
Therefore lo{Ai,D, m, 1/2") = 0. It follows that for all e > 0, m(e, 1, D, 5) < [log2(l/(5)l . ■ 



9. Conclusions 

Cor [12] and Theorem |24] together provide a tight characterization of the sample complexity of any 
sub-Gaussian product distribution with a bounded relative moment. Formally, fix p > 0. For any D 
such that Dx € Vf, and for any 7 > and e G (0, ^ - i*{D)) 

n{k^{Dx)) < m{e,j,D) < O fb^\ . (lO) 
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The upper bound holds uniformly for all distributions, and the constants in the lower bound depend 
only on p. This result shows that the true sample complexity of learning each of these distributions 
with MEM is characterized by the margin-adapted dimension. An interesting conclusion can be 
drawn as to the influence of the conditional distribution of labels Dy\x'- Since Eq. (ITOl ) holds for 
any -Dy|x> the effect of the direction of the best separator on the sample complexity is bounded, 
even for highly non-spherical distributions. 

We note that the upper bound that we have proved involves logarithmic factors which might not 
be necessary. There are upper bounds that depend on the margin alone and on the dimension alone 
without logarithmic factors. On the other hand, in our bound, which combines the two quantities, 
there is a logarithmic dependence which stems from the margin component of the bound. It might 
be possible to tighten the bound and remove the logarithmic dependence. 

Eq. (flOl ) can be used to easily characterize the sample complexity behavior for interesting dis- 
tributions, to compare L2 margin minimization to other learning methods, and to improve certain 
active learning strategies, as we henceforth demonstrate. 



Gaps between Li and L2 regularization in the presence of irrelevant features |Ng| (|2004r) con- 
siders learning a single relevant feature in the presence of many irrelevant features, and compares 
using Li regularization and L2 regularization. When ||X||oo < 1, upper bounds on learnin g with 
Li re gularization guarantee a sample complexity of 0{hi{d)) for an Li-based learning rule (Zhangl 



20021) . In order to compare this with the sample complexity of L2 regularized learning and establish 
a gap, one must use a lower bound on the L2 sample complexity. The argument provided by Ng 
actually assumes scale-invariance of the learning rule, and is therefore valid only for unregularized 
linear learning. In contrast, using our results we can easily establish a lower bound of Q.{d) for 
many specific distributions with a bounded ||X||oo and Y = sign(X[i]) for some i. For instance, 
if each coordinate is a bounded independent sub-Gaussian random variable with a bounded relative 
moment, we have ki = \d/2\ and Theorem l24l implies a lower bound of Q{d) on the L2 sample 
complexity. 

Gaps between generative and discriminative learning for a Gaussian mixture Consider two 
classes, each drawn from a unit- variance spherical Gaussian in with a large distance 2v » \ 
between the class means, such that d » w^. Then P£i[X|y = y] = M{yv • ei, Id), where ei is a 
unit vector in W^. For any v and d, we have Dx G T^l^- For large values of we have extremely 

low margin error at 7 = i;/2, and so we can hope to learn the classes by looking for a large-margin 

2 

separator. Indeed, we can calculate = \d/{l + ^)], and conclude that the required sample 
complexity is 0(d/f ^). Now consider a generative approach: fitting a spherical Gaussian model for 
each class. This amounts to estimating each class center as the empirical average of the points in 
the class, and classifying based on the nearest estimated class center. It is possible to show that for 
any constant e > 0, and for large enough v and d, 0{d/v'^) samples are enough in order to ensure 
an error of e. This establishes a rather large gap of il(w^) between the sample complexity of the 
discriminative approach and that of the generative one. 

Active learning In active learning, there is an abundance of unlabeled examples, but labels are 
costly, and the active learning algorithm needs to decide which labels to query based on the labels 
seen so far. A popular approach to active learning involve s estimating t he current set of possible 
classifiers using sample complexity upper bounds (see e.g. Balcan et al. . 20091 : Beygelzimer et al 



Without any distribution-specific information, only general distribution-free upper bounds 
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can be used. However, since there is an abundance of unlabeled examples, the active learner can 
use these to estimate tighter distribution-specific upper bounds. In the case of linear classifiers, the 
margin-adapted dimension can be calculated from the uncentered covariance matrix of the distribu- 
tion, which can be easily estimated from unlabeled data. Thus, our sample complexity upper bounds 
can be used to improve the active learner's label complexity. Moreover, the lower bound suggests 
that any further improvement of such active learning strategies would require more information 
other than the distribution's covariance matrix. 

To summarize, we have shown that the true sample complexity of large-margin learning of each 
of a rich family of distributions is characterized by the margin-adapted dimension. Characterizing 
the true sample complexity allows a better comparison between this learning approach and other 
algorithms, and has many potential applications. The challenge of characterizing the true sample 
complexity extends to any distribution and any learning approach. Theorem |25] shows that other 
properties but the covariance matrix must be taken into account for general distributions. We believe 
that obtaining answers to these questions is of great importance, both to learning theory and to 
learning applications. 
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Appendix A. Proofs Omitted from the Text 
A.l Proof of Prop. |3] 

Proof Let w* € argmin^gj^d ^^(tt;, D). By Eq. Q, with probability \ — 5/2 



The first inequality follows since the ramp loss is upper bounded by the margin loss. The second 
inequality follows since ^ is a MEM algorithm. Now, by Hoeffding's inequality, since the range of 
ramp is in [0, 1], with probability at least 1 — 5/2 



ramp^(^^(5), D) < ramp^(^^(5'), S) + 27^m(RAMP^, D) + 

Set h* £n such that i^{h*,D) = i*{n, D). We have 

ramp (^^(5), 5) < i^{A^{S),S) < e^{h*,S). 





It follows that with probability 1 — 5 

iamp^{A^{S),D) < l*{n, D) + 27^„(RAMP^, D) + ^- 
We have £o < ramp . Combining this with Eq. ([TT]) we conclude Eq. Q. 



141n(2/(5) 



m 



(11) 
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A.2 Proof of Lemma |9] 

Proof [of Lemma|9l For a function / : — )• M and a z G Z, define tiie function G[f, z] by 

Vx G X, G[f, z]{x) = Ifix) + zix)j - fix). 

Let /i, /2 G K'^ be two functions, and let gi = G[fi, z] G G{fi) for some Wb G V. Then, since 
G[/2,2] G a(/2), wehave 

inf \\g,-g2\\L,(s)<\\G[fi,z]-G[f2,z]\\. 

92^9 (f 2) 

Now, for all x G M, 

|G[/i, z](x) - G[/2, z]{x)\ = \lh{x) + - A(x) - [/2(X) + z{x)j + /2(X)| 

<I/1(X)-/2(X)|. 

Thus, for any S <^ X, 

\\G[h,z]-G[f2,z]\\l^s)=^x^s{G[fi,z]{X)-G[f2,z]{X)f 

< Ex^siMX) - f2{X)f = ll/i - /2||i^(s). 

It follows that infg2gg(/2) \\gi - 52||l2(5) < ll/i - /2||l2(S)- This holds for any gi G thus 

AHiG{fl)Mf2))<\\fl-f2\\L,iS)- ■ 



A.3 Proof of Lemma [TOl 

Proof [of Lemma [TOl Let k be the pseudo-dimension of G{f), and let {xi, . . . , x^} C A' be a set 
which is pseudo-shattered by G{f). We show that the same set is pseudo-shattered by Z as well, 
thus proving the lemma. Since G{f) is pseudo-shattered, there exists a vector r G M'^ such that for 

all y G {±1}'^ there exists a gy € G{f ) such that Vi G [m],sign{gy{xi) — r[i]) = y[i]. Therefore 
for all y G {±1}*^ there exists a Zy € Z such that 

\/i G [A:],sign(|/(xi) + Zy{xi)j - f{xi) - r[{\) = y[i]. 

By considering the case y[i] = 1, we have 

< I/(x,) + Zy{x,)j - /(x,) - r[i] < 1 - /(xi) - r[i]. 

By considering the case y[i] = — 1, we have 

> |/(xi) + Zy{xi)} - f{xi) - r[i\ > -f{xi) - r[i]. 

Therefore < /(xj) + r[i] < 1. Now, let y G {±1}'^ and consider any i G [k]. If y[i\ = 1 then 

lf{xi) + zy{xi)\- f{xi)-r[{\>d 

It follows that 

lf{x,)+Zy{x,)\> f{xi)+r\i]>Q, 



24 



thus 

f{xi) + Zy{xi) > f{xi) + r[i]. 
In other words, sign(zy(xj) — r[i]) = 1 = y[{\. If y[i] = —1 then 

\f{xi)+Zy{xi)\- f{xi)-r[i\<d. 

It follows that 

lf{xi) + Zy{xi)'\ < f{xi) + r[i] < 1, 

thus 

f{xi) + Zy{xi) < f{xi) + r[i\. 

in other words, sign(zy(xj) — r[i]) = —1 = y[i\. We conclude that Z shatters {xi, . . . , x^} as well, 
using the same vector r G M'^. Thus the pseudo-dimension of Z is at least k. ■ 

A.4 Proof of Lemma [TSl 

To prove Lemma [T5l we first prove the following lemma. Denote by conv(^) the convex hull of a 
set A. 

Lemma 26 Let 7 > 0. For each y G {±1}'", select Vy G such that for all i G [m], ry[i]y[i] > 
7. Let i? = {ry G I y G {ibl}'"}. Then {±7}™ C conv(ii). 

Proof We will prove the claim by induction on the dimension m. 

Base case For m = 1, we have R = {a, 6} C M where a < —7 and 6 > 7. Clearly, conv(i2) = 
[a, b], and ±7 G [a, b]. 

Inductive step Assume the lemma holds for m — 1. For a vector t G M*", denote by t its projection 
{t[l], t[m-l]) on M™-i. Similarly, for a set of vectors S C M™, let 5 = {s | s G S"} C R™-i. 
Define Y+ = {±1}™-1 x {+1} and YL = {±l}'"-i x {-!}. Let R+ = {ry \ y e Y+}, 
and similarly for R . Then the induction hypothesis holds for R^ and R- with dimension m — 
1. Let z G {±7}™. We wish to prove z G conv(i?). From the induction hypothesis we have 
z G conv(^+) and z G conv(^_). Thus, for all y G {±1} there exist ay,f3y > such that 

Let = EyeK^ ctyTy and Z5 = Xlj/eF- l^y'^y ^^'^^ ^^"^ ^ > 7> and Vy G 

y_,ry[m] < —7. Therefore, 

•2f)[?Tl] < —7 < z[m] < 7 < ^lal^T-]- 

In addition, Za = Zb = z. Select A G [0,1] such that z\m\ = \za[m\ + (1 — X)z},[m], then 
z = Xza + (1 — A)zb. Since Za, z^ G conv(i?), we have z G conv(i?). ■ 
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Proof [of Lemma [Bl Denote by f{S) the vector (/(xi), . . . , / Recall that r G M"" is the 

witness for the shattering of S, and let 

L = {fiS) - r I / G J-} C 

Since S is shattered, for any y G {±1}™ there is an G L such that Vi G [m], > 7. 

By Lemma|26l {±7}'" C conv(L). Since T is convex, L is also convex. Therefore {±7}™ C L. ■ 



A.5 Proof of Lemma m 

Proof [of Lemma [20I It suffices to consider diagonal moment matrices: If B is not diagonal, let 
V G W^^'^ be an orthogonal matrix such that VBV^ is diagonal, and let Y = YX. We have 
E[exp(t||y IP )] = E[exp(t||X||2)] and trace(VBV^) = trace(B). In addition, for all u G M"', 



E[exp((u,y))] = E[exp((V^n,X))] < exp(; 



'^u,Y^u)) = exp(-(VBV'^n,n)). 



Therefore Y is sub-Gaussian with the diagonal moment matrix VBV^. Thus assume w.l.o.g. that 
B = diag(Ai, . . . , A^) where Ai > . . . > > 0. 

We have exp{t\\X\\^) = Hiei^] exp{tX\i]'^). In addition, for any t > and x G M, 2\/m • 

exp(tx^) = exp(sx — j[)ds. Therefore, for any u G M'^, 



(2Vm)'=' •E[exp(t||X||^)] =E 



E 



ie[d] 



expiu\i]X[i] 



At 



-)du[i] 



—00 ^ — 00 



n exp{u[i]X[i] 



At 



-)du[i] 



E 



exp((u, X) 



At 



du[i 



E[exp((n,X))]exp(-^) J] du\i] 

ie[d] 



2 



By the sub-Gaussianity of X, the last expression is bounded by 

|2 



< 



00 



00 



—00 J —00 

00 POO 



lldu[ 



— 00 J —00 
00 



1 \\u\\ 
exp(-(B'u,'u) - — 

exp(^^-^ ^j^)du[z] 



At 



n r exp(.[f - = n^/^( n - 
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The last equality follows from the fact that for any a > 0, exp(— a ■ s'^)ds = y^II/a, and from 
the assumption t < We conclude that 

d 

E[exp(t||X|p)] < (JJ(1 -2Ait))-^ <exp(2t-^Ai) = exp(2t • trace (B)), 
ie[d] 1=1 

where the second inequahty holds since Vx G [0, 1], (1 — < exp(x). ■ 



A.6 Proof of Theorem |23] 

In the proof of Theorem 123] we use the fact Ami„(XX^) = inf|u.|,2=i ||X'^xf and bound the right- 
hand side via an e-net of the unit sphere in W^, denoted by 5""*"^ = {x G M"* | ||a;||2 = 1}. An 
e-net of the unit sphere is a set C C such that Vx G S"^~^, 3x' G C, ||x — < e. Denote 

the minimal size of an e-net for 5™"^ by J\fm{(-), and by Cm{() a minimal e-net of S^~^, so that 
Cm{() ^ S""^^ and |Cm(e)| = The proof of Theorem l23]requires several lemmas. First we 

prove a concentration result for the norm of a matrix defined by sub-Gaussian variables. Then we 
bound the probability that the squared norm of a vector is small. 

Lemma 27 Let Y be a d x m matrix with m < d, such that Yjj are independent sub-Gaussian 
variables with moment B. Let be a diagonal d x d PSD matrix such that T, < L Then for all 
t>Oandee (0, 1), 



P[||^Y|| >t]<Mmie)exp{ 



trace(S) _ -e)^ 
2 4B2 - 



Proof We have ||\/SY|| < maxxec,ri{e) llv^Yx||/(l — e), see for instance in lBennett et al.l(|l975b . 
Therefore, 

P[||v^Y|| > t] < P[||\/SYx|| > (1 - e)t]. (12) 

Fix X G Cm(e)- Let V = \/SYx, and assume S = diag(Ai, . . . , A^). For u G M^, 

E[exp{{u,V))] = E[exp(^ m^/X, ^ Y^.x,)] = 

i€[d] j&[m] j,i 



< l[exp{ujX,B^x]/2) = exp(— ^ u^X, x 



= 6^P(^ ^ ^'i^i) = exp{{B'^'Eu,u)/2). 

ie[d] 

Thus y is a sub-Gaussian vector with moment matrix B^T,. Let s = l/(4i?^). Since S < /, we 
have s < l/(4i?^ maxjg[^] Aj). Therefore, by Lemma l20l 

E[exp(s||yf )] < exp(2sB2trace(S)). 
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By Chemoff's method, F[\\V\\^ > z^] < E[exp(s||yp)]/ exp(sz2). Thus 



2 4^2^' 
Set z = t{l- e). Then for all x G 5""-^ 

P[||^/SYx|| > til - e)] = F[\\V\\ > t{l - e)] < exp(^^^^ - 



Therefore, by Eq. ([T2] i. 



F|||Vi:Y||>*l<Af,4.)»p(52|E)-^^<i^: 



Lemma 28 Let Y be a d x m matrix with m < d, such that Yij are independent centered random 
variables with variance 1 and fourth moments at most B. Let T, be a diagonal d x d PSD matrix 
such that S < /. There exist a > and rj G (0, 1) that depend only on B such that for any 

P[||^Yxf < a ■ (trace(S) - 1)] < rj''''^^^^ 



To prove Lemma [281 we require Lemma 1291 (jRudelson and Vershyninl . l2008l. Lemma 2.2) and 



Lemma [30l which extends Lemma 2.6 in the same work. 

Lemma 29 Let Ti, . . . ,Tn be independent non-negative random variables. Assume that there are 
6 > and fi G (0, 1) such that for any i, P[Tj < 0] < fi. There are a > and rj £ (0, 1) that 
depend only on 9 and such that 

n 

Pf^Ti < an] < rj''. 

i=l 

Lemma 30 Let Y be a d x m matrix with m < d, such that the columns ofY are i.i.d. random 
vectors. Assume further that Yij are centered, and have a variance of 1 and a fourth moment at 
most B. Let Tibe a diagonal d x d PSD matrix. Then for all x G S""~^, 

P[||\/SYx|| < v^trace(S)/2] < 1 - 1/(196S). 

Proof Let x £ S"""\ and Ti = (ZljLi ^ij^j)'^- Let Ai, . . . , be the values on the diagonal of S, 
and let Te = ||\/SYxf = X^li AiT^. First, since E[Yij] = and E[Yij] = 1 for all we have 

E[T,] = ^ x]E[Yl] = \\xf = 1. 

ie[m] 

Therefore K\Ty} = trace(E) . Second, since Yji, . . . , Yj^ are independent and centered, we have 



(|Ledoux and TalagrandLll99lL Lemma 6.3) 



E[T^]=E[{Y^ Y,,x,)^] < 16E.[( ajY,,x,)% 
je[m] jeH 
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where a^ am. are independ ent uniform {±1} variables. Now, by Khinchine's inequality 

( Nazarov and Podkorytov , 20001) . 



je[m] j&[m' 



3 ^]4mimlk]- 

j,fce[m] 



Now E[Y2^.]E[Y2,] < ^E[Y4.]E[Y:y < B. ThnsE[T^] < iSB ZjM[m]^H = ^S^H^I 
485. Thus, 

d d 

i=l i,j=l 
d d 

< Y AiAjYE[i;2]E[T2] < 485(^ A,)^ = 485 • trace(S)2. 

i,j=l i=l 

By the Paley-Zigmund inequality ( Paley and Zygmundl. 1932 ). for 9 E [0, 1] 

p[rE > 0E[Tj:]] > (1 - e?^^ >^^~ 



E[r|] 



485 



Therefore, setting 6 = 1/2, we get F[T^ < trace(S)/2] < 1 - 1/(1965). 



Proof [of Lemma l28l Let Ai E [0,1] be the values on the diagonal of S. Consider a partition 

Zi, . . . , Zk of [d], and denote Lj = J2iez Aj- There exists such a partition such that for all j E [k], 
Lj < 1, and for all j E [k — 1], Lj > i. Let be the sub-matrix of S that includes the rows 
and columns whose indexes are in Zj. Let Y[j] be the sub-matrix of Y that includes the rows in Zj. 
Denote = \\^y^j]Y[j]xf. Then 

rn 

je[k]iGZj j=i ie[fc] 

We have trace(S) = X]f=i Aj > J2je[k-i] -^j — ~ addition, Lj < 1 for all 

j E [A;]. Thus trace(S) < k < 2trace(S) + 1. For all j £ [k - 1], Lj > i, thus by Lemma[30l 
F[Tj < 1/4] < 1 - 1/(1965). Therefore, by Lemma |29]there are a > and r] E (0, 1) that depend 
only on 5 such that 

F[\\V^Yxf < a • (trace($]) - 1)] < P[||\/SYxf < a{k - 1)] 

= Tj < a{k - 1)] < P[ ^ Tj < a{k - 1)] < r]^-^ < ,^2tracc(s)_ 

je[fe] je[fc-i] 

The lemma follows by substituting r/ for rf. ■ 



Proof [of Theorem |23l We have 

a/a~(XX^= inf llX'^xll > min ||X^x|| - e||X^||. (13) 
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For brevity, denote L = trace(S). Assume L > 2. Let m < L ■ min(l, (c — Ke)'^) where c, K, e 
are constants that will be set later such that c — Ke > 0. By Eq. (fT3l) 



P[A„im(XX^) <m]< P[Amm(XX^) < (c - Ke)^L] 

<P[ mill ||X^x|| - e||X^|| < (c- i^e)\/Z] (14) 

< P[||X'^ II > i^/L] +P[ mill llX'^xll < c\/I]. (15) 

a;eCm{e) 

The last inequality holds since the inequality in line (fT4l) implies at least one of the inequalities in 
line ( fT5l ). We will now upper-bound each of the terms in line ([TSl ). We assume w.l.o.g. that S is not 
singular (since zero rows and columns can be removed from X without changing Amin(XX^)). De- 
fine Y = VrF^X"^. Note that Yij are independent sub-Gaussian variables with (absolute) moment 
p. To bound the first term in line (ITST l. note that by Lemma |27] for any K > 0, 

P[||X^|| > K^] = P[||^Y|| > kVl] < Mmih exp(L(i - -f^)). 



Bv iRudelson and VershyninI (l2009h . Proposition 2.1, for all e G [0, 1], Mm{e) < 2m(l 



+ 



2\m-l 



Therefore 

P[||X^|| > kVl] < 2m5— iexp(L(i - ^)). 

Let ^2 = 16p2(| + ln(5) + ln(2/5)). Recall that by assumption m < L, and L > 2. Therefore 

P[||X'^|| > kVl] < 2m5'"-^ exp(-L(l + ln(5) + ln(2/(5))) 
< 2L5^-^ exp(-L(l + ln(5) + ln(2/(5))). 

Since L > 2, we have 2Lexp(— L) < 1. Therefore 

P[||X^|| > K/L] < 2Lexp(-L-ln(2/5)) < exp(- ln(2/5)) = -. (16) 
To bound the second term in line ([TST l. since Yij aie sub-Gaussian with moment p, E[Y^ ] < 5/3^ 



(IBuldygin and Kozachenkd . ll998. . Lemma 1.4). Thus, by Lemma[28l there are a > and r] G (0, 1) 



that depend only on p such that f or all x £ 5"^"^ P[||\/EYx|p < a{L - 1)] < r/^. Set c = ^072- 
Since L > 2, we have c^/L < ^ a{L — 1). Thus 

P[ mill ||X'^x|| < c\/L] < V P[||X^x|| < c\/Z] 

< IP[||\/SYx|| < - 1)] < ^m{e)r]^. 

x£Cm{e) 
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Let e = c/{2K), so that c - Ke > 0. Let (9 = min(i, 2in(i+2A) )- ^° ^"'^^ that VL > Lo, 

V(5)+2b 
ln(l/»)) 



L > For L > Lo and m < < L/2, 



AL(e)r/^ < 2m(l + 2/e)™-ir/^ 

< Lexp(L(01n(l + 2/e) - ln(l/r/))) 

= exp(ln(L) + L{e\n{l + 2/e) - ln(l/r?)/2) - L ln(l/?7)/2) 

< exp(L(0 ln(l + 2/e) - ln(l/77)/2) + ln((5/2)) (17) 

< exp(ln(V2)) = -. (18) 

Line ^ follows from L > Lo, and line ^ follows from (91n(l + 2/e) - ln(l/7/)/2 < 0. 
Set /3 = min{(c - Kef ,1,6}. Combining Eq. Eq. (O and Eq. ^ we have that if 
L> L = max(Lo, 2), then P[Amin(XX'^) < m] < 6 for all m < /3L. Specifically, this holds for 
all L > and for all m < f3{L — L). Letting C = f3L and substituting 5 for 1 — 5 we get the 
statement of the theorem. ■ 
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